{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Quasi-binomial regression\n",
    "\n",
    "This notebook demonstrates using custom variance functions and non-binary data\n",
    "with the quasi-binomial GLM family to perform a regression analysis using\n",
    "a dependent variable that is a proportion.\n",
    "\n",
    "The notebook uses the barley leaf blotch data that has been discussed in\n",
    "several textbooks. See below for one reference:\n",
    "\n",
    "https://support.sas.com/documentation/cdl/en/statug/63033/HTML/default/viewer.htm#statug_glimmix_sect016.htm"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import statsmodels.api as sm\n",
    "import numpy as np\n",
    "import pandas as pd\n",
    "import matplotlib.pyplot as plt\n",
    "from io import StringIO"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The raw data, expressed as percentages.  We will divide by 100\n",
    "to obtain proportions."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "raw = StringIO(\n",
    "    \"\"\"0.05,0.00,1.25,2.50,5.50,1.00,5.00,5.00,17.50\n",
    "0.00,0.05,1.25,0.50,1.00,5.00,0.10,10.00,25.00\n",
    "0.00,0.05,2.50,0.01,6.00,5.00,5.00,5.00,42.50\n",
    "0.10,0.30,16.60,3.00,1.10,5.00,5.00,5.00,50.00\n",
    "0.25,0.75,2.50,2.50,2.50,5.00,50.00,25.00,37.50\n",
    "0.05,0.30,2.50,0.01,8.00,5.00,10.00,75.00,95.00\n",
    "0.50,3.00,0.00,25.00,16.50,10.00,50.00,50.00,62.50\n",
    "1.30,7.50,20.00,55.00,29.50,5.00,25.00,75.00,95.00\n",
    "1.50,1.00,37.50,5.00,20.00,50.00,50.00,75.00,95.00\n",
    "1.50,12.70,26.25,40.00,43.50,75.00,75.00,75.00,95.00\"\"\"\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The regression model is a two-way additive model with\n",
    "site and variety effects.  The data are a full unreplicated\n",
    "design with 10 rows (sites) and 9 columns (varieties)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df = pd.read_csv(raw, header=None)\n",
    "df = df.melt()\n",
    "df[\"site\"] = 1 + np.floor(df.index / 10).astype(int)\n",
    "df[\"variety\"] = 1 + (df.index % 10)\n",
    "df = df.rename(columns={\"value\": \"blotch\"})\n",
    "df = df.drop(\"variable\", axis=1)\n",
    "df[\"blotch\"] /= 100"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Fit the quasi-binomial regression with the standard variance\n",
    "function."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "model1 = sm.GLM.from_formula(\n",
    "    \"blotch ~ 0 + C(variety) + C(site)\", family=sm.families.Binomial(), data=df\n",
    ")\n",
    "result1 = model1.fit(scale=\"X2\")\n",
    "print(result1.summary())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The plot below shows that the default variance function is\n",
    "not capturing the variance structure very well. Also note\n",
    "that the scale parameter estimate is quite small."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "lines_to_next_cell": 1
   },
   "outputs": [],
   "source": [
    "plt.clf()\n",
    "plt.grid(True)\n",
    "plt.plot(result1.predict(linear=True), result1.resid_pearson, \"o\")\n",
    "plt.xlabel(\"Linear predictor\")\n",
    "plt.ylabel(\"Residual\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "An alternative variance function is mu^2 * (1 - mu)^2."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "lines_to_next_cell": 1
   },
   "outputs": [],
   "source": [
    "class vf(sm.families.varfuncs.VarianceFunction):\n",
    "    def __call__(self, mu):\n",
    "        return mu ** 2 * (1 - mu) ** 2\n",
    "\n",
    "    def deriv(self, mu):\n",
    "        return 2 * mu - 6 * mu ** 2 + 4 * mu ** 3"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Fit the quasi-binomial regression with the alternative variance\n",
    "function."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "bin = sm.families.Binomial()\n",
    "bin.variance = vf()\n",
    "model2 = sm.GLM.from_formula(\"blotch ~ 0 + C(variety) + C(site)\", family=bin, data=df)\n",
    "result2 = model2.fit(scale=\"X2\")\n",
    "print(result2.summary())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "With the alternative variance function, the mean/variance relationship\n",
    "seems to capture the data well, and the estimated scale parameter is\n",
    "close to 1."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "plt.clf()\n",
    "plt.grid(True)\n",
    "plt.plot(result2.predict(linear=True), result2.resid_pearson, \"o\")\n",
    "plt.xlabel(\"Linear predictor\")\n",
    "plt.ylabel(\"Residual\")"
   ]
  }
 ],
 "metadata": {
  "jupytext": {
   "cell_metadata_filter": "-all",
   "main_language": "python",
   "notebook_metadata_filter": "-all"
  },
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.8.10"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}
